Assignment_1PM566

Author

Mona Bandov

Published

September 29, 2023

Assignment 1 (Due September 29, 2023 11:59pm)

Loading Libraries and Data:

library(readr)
library(magrittr)
library(devtools)
Loading required package: usethis
library(leaflet)
library(knitr)
library(leaflet)
library(tidyr)

Attaching package: 'tidyr'
The following object is masked from 'package:magrittr':

    extract
library(datasets)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
PM2022 <- read_csv("~/Desktop/PM566/Assigment_1/2022_PM2.5.csv")
Rows: 57761 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Date, Source, Site ID, UNITS, Site Name, AQS_PARAMETER_DESC, CBSA_...
dbl  (9): POC, Daily Mean PM2.5 Concentration, DAILY_AQI_VALUE, DAILY_OBS_CO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
PM2002 <- read_csv("~/Desktop/PM566/Assigment_1/2002_PM2.5.csv")
Rows: 15976 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Date, Source, Site ID, UNITS, Site Name, AQS_PARAMETER_DESC, CBSA_...
dbl  (9): POC, Daily Mean PM2.5 Concentration, DAILY_AQI_VALUE, DAILY_OBS_CO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.

2020 Data Set Exploration:

dim(PM2022)
[1] 57761    20

There are 57761 rows and 20 columns

head(PM2022)
# A tibble: 6 × 20
  Date       Source `Site ID`   POC Daily Mean PM2.5 Con…¹ UNITS DAILY_AQI_VALUE
  <chr>      <chr>  <chr>     <dbl>                  <dbl> <chr>           <dbl>
1 01/01/2022 AQS    060010007     3                   12.7 ug/m…              52
2 01/02/2022 AQS    060010007     3                   13.9 ug/m…              55
3 01/03/2022 AQS    060010007     3                    7.1 ug/m…              30
4 01/04/2022 AQS    060010007     3                    3.7 ug/m…              15
5 01/05/2022 AQS    060010007     3                    4.2 ug/m…              18
6 01/06/2022 AQS    060010007     3                    3.8 ug/m…              16
# ℹ abbreviated name: ¹​`Daily Mean PM2.5 Concentration`
# ℹ 13 more variables: `Site Name` <chr>, DAILY_OBS_COUNT <dbl>,
#   PERCENT_COMPLETE <dbl>, AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>,
#   CBSA_CODE <dbl>, CBSA_NAME <chr>, STATE_CODE <chr>, STATE <chr>,
#   COUNTY_CODE <chr>, COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>
head(PM2022)
# A tibble: 6 × 20
  Date       Source `Site ID`   POC Daily Mean PM2.5 Con…¹ UNITS DAILY_AQI_VALUE
  <chr>      <chr>  <chr>     <dbl>                  <dbl> <chr>           <dbl>
1 01/01/2022 AQS    060010007     3                   12.7 ug/m…              52
2 01/02/2022 AQS    060010007     3                   13.9 ug/m…              55
3 01/03/2022 AQS    060010007     3                    7.1 ug/m…              30
4 01/04/2022 AQS    060010007     3                    3.7 ug/m…              15
5 01/05/2022 AQS    060010007     3                    4.2 ug/m…              18
6 01/06/2022 AQS    060010007     3                    3.8 ug/m…              16
# ℹ abbreviated name: ¹​`Daily Mean PM2.5 Concentration`
# ℹ 13 more variables: `Site Name` <chr>, DAILY_OBS_COUNT <dbl>,
#   PERCENT_COMPLETE <dbl>, AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>,
#   CBSA_CODE <dbl>, CBSA_NAME <chr>, STATE_CODE <chr>, STATE <chr>,
#   COUNTY_CODE <chr>, COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>
colnames(PM2022)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "UNITS"                         
 [7] "DAILY_AQI_VALUE"                "Site Name"                     
 [9] "DAILY_OBS_COUNT"                "PERCENT_COMPLETE"              
[11] "AQS_PARAMETER_CODE"             "AQS_PARAMETER_DESC"            
[13] "CBSA_CODE"                      "CBSA_NAME"                     
[15] "STATE_CODE"                     "STATE"                         
[17] "COUNTY_CODE"                    "COUNTY"                        
[19] "SITE_LATITUDE"                  "SITE_LONGITUDE"                

We have 20 column names: “Date”,“Source”, “Site ID”,“POC”, “Daily Mean PM2.5 Concentration”,“UNITS”,“DAILY_AQI_VALUE”,“Site Name” ,“DAILY_OBS_COUNT”, “PERCENT_COMPLETE”, “AQS_PARAMETER_CODE”, “AQS_PARAMETER_DESC”, “CBSA_CODE”, “CBSA_NAME”, “STATE_CODE”, “STATE”, “COUNTY_CODE”, “COUNTY”, “SITE_LATITUDE”, “SITE_LONGITUDE” (Not in this order)

We see that the column names are not in the best form to filter/manipulate the data. I will edit the column names so that it is easier to compute the desired information.

summary(PM2022)
     Date              Source            Site ID               POC        
 Length:57761       Length:57761       Length:57761       Min.   : 1.000  
 Class :character   Class :character   Class :character   1st Qu.: 1.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 3.000  
                                                          Mean   : 2.531  
                                                          3rd Qu.: 3.000  
                                                          Max.   :21.000  
                                                                          
 Daily Mean PM2.5 Concentration    UNITS           DAILY_AQI_VALUE 
 Min.   : -2.200                Length:57761       Min.   :  0.00  
 1st Qu.:  4.200                Class :character   1st Qu.: 18.00  
 Median :  7.000                Mode  :character   Median : 29.00  
 Mean   :  8.564                                   Mean   : 32.94  
 3rd Qu.: 10.900                                   3rd Qu.: 45.00  
 Max.   :302.500                                   Max.   :353.00  
                                                                   
  Site Name         DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
 Length:57761       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88196     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88101     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS_PARAMETER_DESC   CBSA_CODE      CBSA_NAME          STATE_CODE       
 Length:57761       Min.   :12540   Length:57761       Length:57761      
 Class :character   1st Qu.:31080   Class :character   Class :character  
 Mode  :character   Median :40140   Mode  :character   Mode  :character  
                    Mean   :35445                                        
                    3rd Qu.:41860                                        
                    Max.   :49700                                        
                    NA's   :4761                                         
    STATE           COUNTY_CODE           COUNTY          SITE_LATITUDE  
 Length:57761       Length:57761       Length:57761       Min.   :32.58  
 Class :character   Class :character   Class :character   1st Qu.:34.14  
 Mode  :character   Mode  :character   Mode  :character   Median :36.60  
                                                          Mean   :36.37  
                                                          3rd Qu.:38.10  
                                                          Max.   :41.76  
                                                                         
 SITE_LONGITUDE  
 Min.   :-124.2  
 1st Qu.:-121.5  
 Median :-119.8  
 Mean   :-119.7  
 3rd Qu.:-118.1  
 Max.   :-115.5  
                 

After looking at this, I saw something weird about the data. Daily Mean PM2.5 Concentration’s minimum is -2.200. It doesn’t make sense that the concentration is less than 0. I will explore this later.

2002 Data Set Exploration:

dim(PM2002)
[1] 15976    20

There are 15976 rows and 20 columns.

head(PM2002)
# A tibble: 6 × 20
  Date       Source `Site ID`   POC Daily Mean PM2.5 Con…¹ UNITS DAILY_AQI_VALUE
  <chr>      <chr>  <chr>     <dbl>                  <dbl> <chr>           <dbl>
1 01/05/2002 AQS    060010007     1                   25.1 ug/m…              78
2 01/06/2002 AQS    060010007     1                   31.6 ug/m…              92
3 01/08/2002 AQS    060010007     1                   21.4 ug/m…              71
4 01/11/2002 AQS    060010007     1                   25.9 ug/m…              80
5 01/14/2002 AQS    060010007     1                   34.5 ug/m…              98
6 01/17/2002 AQS    060010007     1                   41   ug/m…             115
# ℹ abbreviated name: ¹​`Daily Mean PM2.5 Concentration`
# ℹ 13 more variables: `Site Name` <chr>, DAILY_OBS_COUNT <dbl>,
#   PERCENT_COMPLETE <dbl>, AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>,
#   CBSA_CODE <dbl>, CBSA_NAME <chr>, STATE_CODE <chr>, STATE <chr>,
#   COUNTY_CODE <chr>, COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>
tail(PM2002)
# A tibble: 6 × 20
  Date       Source `Site ID`   POC Daily Mean PM2.5 Con…¹ UNITS DAILY_AQI_VALUE
  <chr>      <chr>  <chr>     <dbl>                  <dbl> <chr>           <dbl>
1 12/10/2002 AQS    061131003     1                     15 ug/m…              57
2 12/13/2002 AQS    061131003     1                     15 ug/m…              57
3 12/22/2002 AQS    061131003     1                      1 ug/m…               4
4 12/25/2002 AQS    061131003     1                     23 ug/m…              74
5 12/28/2002 AQS    061131003     1                      5 ug/m…              21
6 12/31/2002 AQS    061131003     1                      6 ug/m…              25
# ℹ abbreviated name: ¹​`Daily Mean PM2.5 Concentration`
# ℹ 13 more variables: `Site Name` <chr>, DAILY_OBS_COUNT <dbl>,
#   PERCENT_COMPLETE <dbl>, AQS_PARAMETER_CODE <dbl>, AQS_PARAMETER_DESC <chr>,
#   CBSA_CODE <dbl>, CBSA_NAME <chr>, STATE_CODE <chr>, STATE <chr>,
#   COUNTY_CODE <chr>, COUNTY <chr>, SITE_LATITUDE <dbl>, SITE_LONGITUDE <dbl>
colnames(PM2002)
 [1] "Date"                           "Source"                        
 [3] "Site ID"                        "POC"                           
 [5] "Daily Mean PM2.5 Concentration" "UNITS"                         
 [7] "DAILY_AQI_VALUE"                "Site Name"                     
 [9] "DAILY_OBS_COUNT"                "PERCENT_COMPLETE"              
[11] "AQS_PARAMETER_CODE"             "AQS_PARAMETER_DESC"            
[13] "CBSA_CODE"                      "CBSA_NAME"                     
[15] "STATE_CODE"                     "STATE"                         
[17] "COUNTY_CODE"                    "COUNTY"                        
[19] "SITE_LATITUDE"                  "SITE_LONGITUDE"                

The column names are the same as 2020 data.

summary(PM2002)
     Date              Source            Site ID               POC       
 Length:15976       Length:15976       Length:15976       Min.   :1.000  
 Class :character   Class :character   Class :character   1st Qu.:1.000  
 Mode  :character   Mode  :character   Mode  :character   Median :1.000  
                                                          Mean   :1.581  
                                                          3rd Qu.:1.000  
                                                          Max.   :6.000  
                                                                         
 Daily Mean PM2.5 Concentration    UNITS           DAILY_AQI_VALUE 
 Min.   :  0.00                 Length:15976       Min.   :  0.00  
 1st Qu.:  7.00                 Class :character   1st Qu.: 29.00  
 Median : 12.00                 Mode  :character   Median : 50.00  
 Mean   : 16.12                                    Mean   : 53.68  
 3rd Qu.: 20.50                                    3rd Qu.: 69.00  
 Max.   :104.30                                    Max.   :176.00  
                                                                   
  Site Name         DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
 Length:15976       Min.   :1       Min.   :100      Min.   :88101     
 Class :character   1st Qu.:1       1st Qu.:100      1st Qu.:88101     
 Mode  :character   Median :1       Median :100      Median :88101     
                    Mean   :1       Mean   :100      Mean   :88215     
                    3rd Qu.:1       3rd Qu.:100      3rd Qu.:88502     
                    Max.   :1       Max.   :100      Max.   :88502     
                                                                       
 AQS_PARAMETER_DESC   CBSA_CODE      CBSA_NAME          STATE_CODE       
 Length:15976       Min.   :12540   Length:15976       Length:15976      
 Class :character   1st Qu.:23420   Class :character   Class :character  
 Mode  :character   Median :40140   Mode  :character   Mode  :character  
                    Mean   :33270                                        
                    3rd Qu.:41740                                        
                    Max.   :49700                                        
                    NA's   :929                                          
    STATE           COUNTY_CODE           COUNTY          SITE_LATITUDE  
 Length:15976       Length:15976       Length:15976       Min.   :32.63  
 Class :character   Class :character   Class :character   1st Qu.:34.07  
 Mode  :character   Mode  :character   Mode  :character   Median :35.36  
                                                          Mean   :36.00  
                                                          3rd Qu.:37.77  
                                                          Max.   :41.71  
                                                                         
 SITE_LONGITUDE  
 Min.   :-124.2  
 1st Qu.:-121.4  
 Median :-119.1  
 Mean   :-119.4  
 3rd Qu.:-117.9  
 Max.   :-115.5  
                 

Unlike 2020 data, there minimum for Daily Mean PM2.5 concentration is 0, but the max is 104.3. The max of the 202 data was 302.500. We may have to look into it.

  1. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
bothyears <- rbind(PM2002, PM2022)
#Separated Dates

bothyears1 <- data.frame(date = bothyears$Date)
year_dataframe<-separate(bothyears1 , "date", c("Month", "Day", "Year"), sep = "/")

#Combined the data in which the year column has a long name
semifinal_dataset<-cbind(bothyears,year_dataframe$Year,year_dataframe$Month)

#Change all the column names in the data frame

final_dataset <-semifinal_dataset %>% rename("YEAR"= "year_dataframe$Year","Daily_Mean_PM2.5_Concentration"= "Daily Mean PM2.5 Concentration","Site_Name"= "Site Name" , "Month"= "year_dataframe$Month" )
head(final_dataset)
        Date Source   Site ID POC Daily_Mean_PM2.5_Concentration    UNITS
1 01/05/2002    AQS 060010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 060010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 060010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 060010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 060010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 060010007   1                           41.0 ug/m3 LC
  DAILY_AQI_VALUE Site_Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1              78 Livermore               1              100              88101
2              92 Livermore               1              100              88101
3              71 Livermore               1              100              88101
4              80 Livermore               1              100              88101
5              98 Livermore               1              100              88101
6             115 Livermore               1              100              88101
        AQS_PARAMETER_DESC CBSA_CODE                         CBSA_NAME
1 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
2 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
3 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
4 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
5 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
6 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  STATE_CODE      STATE COUNTY_CODE  COUNTY SITE_LATITUDE SITE_LONGITUDE YEAR
1         06 California         001 Alameda      37.68753      -121.7842 2002
2         06 California         001 Alameda      37.68753      -121.7842 2002
3         06 California         001 Alameda      37.68753      -121.7842 2002
4         06 California         001 Alameda      37.68753      -121.7842 2002
5         06 California         001 Alameda      37.68753      -121.7842 2002
6         06 California         001 Alameda      37.68753      -121.7842 2002
  Month
1    01
2    01
3    01
4    01
5    01
6    01
  1. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
#Color Pallete 
color_palette <- colorFactor(
  palette = c( "green", "orange"),  # Specify colors for each year
  domain = final_dataset$YEAR  # Use the YEAR column from your dataset
)
leaflet(final_dataset) %>%
  addProviderTiles('OpenStreetMap') %>% 
  addCircles(
    lat = ~SITE_LATITUDE,
    lng = ~SITE_LONGITUDE,
    opacity = 1,
    fillOpacity = 1,
    radius = 100,
    color = ~color_palette(YEAR),  # Color circles based on the YEAR column
    popup = ~Site_Name  # Display SITE_NAME in the popup
    )%>%
  addLegend(
    pal = color_palette,
    values = ~YEAR,
    title = "Year",
    position = "bottomright")

The dots are suppose to represent each (data station) data point, but in the exploratory data analysis, we saw that there was more data 2002 and 2020 data set, but that is not reflective on the map. We would expect more dots for 2022, and even more for 2002.The data is limited to just California.

  1. Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
#Finding the missing values: There were no missing vlaues 
missing_pm <- final_dataset %>%filter(is.na(Daily_Mean_PM2.5_Concentration))
max(final_dataset$Daily_Mean_PM2.5_Concentration)
[1] 302.5
min(final_dataset$Daily_Mean_PM2.5_Concentration)
[1] -2.2
#I filted out to see how many negative values there are and other information. 
negative_PM2.5 <- final_dataset %>%filter(Daily_Mean_PM2.5_Concentration<0)
dim(negative_PM2.5) #218 values that are less than 0
[1] 218  22
head(negative_PM2.5)
        Date Source   Site ID POC Daily_Mean_PM2.5_Concentration    UNITS
1 07/06/2022    AQS 060010011   3                           -0.7 ug/m3 LC
2 07/30/2022    AQS 060010011   3                           -0.1 ug/m3 LC
3 08/26/2022    AQS 060010011   3                           -0.5 ug/m3 LC
4 02/01/2022    AQS 060072002   3                           -0.3 ug/m3 LC
5 02/06/2022    AQS 060072002   3                           -0.1 ug/m3 LC
6 02/22/2022    AQS 060072002   3                           -0.3 ug/m3 LC
  DAILY_AQI_VALUE          Site_Name DAILY_OBS_COUNT PERCENT_COMPLETE
1               0       Oakland West               1              100
2               0       Oakland West               1              100
3               0       Oakland West               1              100
4               0 Paradise - Theater               1              100
5               0 Paradise - Theater               1              100
6               0 Paradise - Theater               1              100
  AQS_PARAMETER_CODE                     AQS_PARAMETER_DESC CBSA_CODE
1              88101               PM2.5 - Local Conditions     41860
2              88101               PM2.5 - Local Conditions     41860
3              88101               PM2.5 - Local Conditions     41860
4              88502 Acceptable PM2.5 AQI & Speciation Mass     17020
5              88502 Acceptable PM2.5 AQI & Speciation Mass     17020
6              88502 Acceptable PM2.5 AQI & Speciation Mass     17020
                          CBSA_NAME STATE_CODE      STATE COUNTY_CODE  COUNTY
1 San Francisco-Oakland-Hayward, CA         06 California         001 Alameda
2 San Francisco-Oakland-Hayward, CA         06 California         001 Alameda
3 San Francisco-Oakland-Hayward, CA         06 California         001 Alameda
4                         Chico, CA         06 California         007   Butte
5                         Chico, CA         06 California         007   Butte
6                         Chico, CA         06 California         007   Butte
  SITE_LATITUDE SITE_LONGITUDE YEAR Month
1      37.81478      -122.2823 2022    07
2      37.81478      -122.2823 2022    07
3      37.81478      -122.2823 2022    08
4      39.77919      -121.5914 2022    02
5      39.77919      -121.5914 2022    02
6      39.77919      -121.5914 2022    02
# I aggregated county data to find the summary statistic
final_dataset1 <- final_dataset %>%
  group_by(COUNTY)%>% summarize(
    Mean_Variable = mean(Daily_Mean_PM2.5_Concentration),
    Median_Variable = median(Daily_Mean_PM2.5_Concentration),
    Min_Var= min(Daily_Mean_PM2.5_Concentration),
    Max_Var= max(Daily_Mean_PM2.5_Concentration),
    Count = n()
  )
head(final_dataset1)
# A tibble: 6 × 6
  COUNTY       Mean_Variable Median_Variable Min_Var Max_Var Count
  <chr>                <dbl>           <dbl>   <dbl>   <dbl> <int>
1 Alameda               8.81            7.2     -0.7    61.6  1994
2 Butte                 8.71            6       -0.6    88    1599
3 Calaveras             6.60            5.3      0      40     415
4 Colusa                8.40            7        0.6    57     496
5 Contra Costa          9.95            7.8      0.3    76.7  1097
6 Del Norte             4.75            4.05    -0.8    21.2   568
#I aggregated  site data to find the summary statistic
final_dataset2 <- final_dataset %>%
  group_by(Site_Name)%>% summarize(
    Mean_Variable = mean(Daily_Mean_PM2.5_Concentration),
    Median_Variable = median(Daily_Mean_PM2.5_Concentration),
    Min_Var= min(Daily_Mean_PM2.5_Concentration),
    Max_Var= max(Daily_Mean_PM2.5_Concentration),
    Count = n()
  )
head(final_dataset2)
# A tibble: 6 × 6
  Site_Name               Mean_Variable Median_Variable Min_Var Max_Var Count
  <chr>                           <dbl>           <dbl>   <dbl>   <dbl> <int>
1 29 Palms                        10.4             8.3      2.4   110.    414
2 3425 N FIRST ST, FRESNO         22.2            14        2      92.5   499
3 Alpine                           6.29            5.8     -0.3    18.7   382
4 Alturas-Fourth St.               5               5        5       5       2
5 Anaheim                         12.3            10.2     -0.4    68.6  1125
6 Aqua Tibia Wilderness            6.88            5.95     0.6    19.2   168

There are no N/A values in Daily Mean PM2.5 Concentration. The minimum is of this column is -2.2 and the maximum is 302.5. Like mentioned before, it is impossible to have a concentration level less than 0. The value 302.5 makes sense; it would be considered hazardous. In this data set, we have to keep acount that 218 rows in the data set that have a Daily Mean PM2.5 Concentration are less than 0. I then computed the mean, median,min, count and max for each county in finaldataset1. In finaldaraset2, I computed the same information except it is for each site in the data set.

  1. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
    • state

    • county

    • site in Los Angeles

STATE ANALYSIS: (Original Data Since I Filtered It

#Box Plot 
ggplot(final_dataset, aes(x = YEAR, y =Daily_Mean_PM2.5_Concentration )) +
  geom_boxplot() +
  labs(title = "Box Plot of Daily Mean PM2.5 Concentration by Year in California", x = "Year", y = "PM2.5")

#Qunatile Numbers: 
quantile_data <- final_dataset %>%
  group_by(YEAR) %>%
  summarize(
    Q25 = quantile(Daily_Mean_PM2.5_Concentration, 0.25),
    Median = quantile(Daily_Mean_PM2.5_Concentration, 0.5),
    Q75 = quantile(Daily_Mean_PM2.5_Concentration, 0.75))

(quantile_data)
# A tibble: 2 × 4
  YEAR    Q25 Median   Q75
  <chr> <dbl>  <dbl> <dbl>
1 2002    7       12  20.5
2 2022    4.2      7  10.9

Here, I created the box and whiskers plot to assess the difference in Daily Mean PM2.5 Concentration between two underlying groups (2002 versus 2022). I then tabled the IQR. I can see there are more outliers in 2022 then there is in 2002. There is more of a spread in 2002 compared to 2022.

#Histogram of the same data 
ggplot(final_dataset, aes(x = log(Daily_Mean_PM2.5_Concentration), fill=YEAR)) +
  geom_histogram(binwidth = 0.50) +
  labs(title = "Histogram of PM2.5 Values in California by Year",
       x = "PM2.5",
       y = "Frequency")
Warning in log(Daily_Mean_PM2.5_Concentration): NaNs produced

Warning in log(Daily_Mean_PM2.5_Concentration): NaNs produced
Warning: Removed 338 rows containing non-finite values (`stat_bin()`).

In addition to the box and whiskers plot, I made a histogram to show the same information, but in a distribution, I decided to take the log of the x values to transform the data to see the difference in a more digestible way. We can see that the data in 2002 is more spreaded out.

County Data:

final_dataset1 <- final_dataset %>%
  group_by(COUNTY)%>% summarize(
    Mean_Variable = mean(Daily_Mean_PM2.5_Concentration),
    Median_Variable = median(Daily_Mean_PM2.5_Concentration),
    Min_Var= min(Daily_Mean_PM2.5_Concentration),
    Max_Var= max(Daily_Mean_PM2.5_Concentration),
    Count = n())
#dim(final_dataset1)
head(final_dataset)
        Date Source   Site ID POC Daily_Mean_PM2.5_Concentration    UNITS
1 01/05/2002    AQS 060010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 060010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 060010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 060010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 060010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 060010007   1                           41.0 ug/m3 LC
  DAILY_AQI_VALUE Site_Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
1              78 Livermore               1              100              88101
2              92 Livermore               1              100              88101
3              71 Livermore               1              100              88101
4              80 Livermore               1              100              88101
5              98 Livermore               1              100              88101
6             115 Livermore               1              100              88101
        AQS_PARAMETER_DESC CBSA_CODE                         CBSA_NAME
1 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
2 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
3 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
4 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
5 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
6 PM2.5 - Local Conditions     41860 San Francisco-Oakland-Hayward, CA
  STATE_CODE      STATE COUNTY_CODE  COUNTY SITE_LATITUDE SITE_LONGITUDE YEAR
1         06 California         001 Alameda      37.68753      -121.7842 2002
2         06 California         001 Alameda      37.68753      -121.7842 2002
3         06 California         001 Alameda      37.68753      -121.7842 2002
4         06 California         001 Alameda      37.68753      -121.7842 2002
5         06 California         001 Alameda      37.68753      -121.7842 2002
6         06 California         001 Alameda      37.68753      -121.7842 2002
  Month
1    01
2    01
3    01
4    01
5    01
6    01
# Bar plot for Mean_Variable by COUNTY
bar_plot <- ggplot(final_dataset1, aes(x = COUNTY, y = Mean_Variable)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "COUNTY", y = "Mean Daily_Mean_PM2.5_Concentration") +
  ggtitle("Mean Daily PM2.5 Concentration by COUNTY") +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_plot)

There are 51 counties in this data set, but California actually has 58. In the exploratory data analysis, this information can help with identifying why other counties were not excluded. From the bar plit, we can compare the mean daily PM2.5 concentration in rach county. We can see that the highest concentration amount is Kern and Kinds. We can see the overall trend for each county. However, I do think that there are better wats to visualize it. I think that becuase there are a lot of counties, it can be hard to visualize it. I think a data table with values would be the best way to see which is the highest and which is the lowest.

Los Angeles County Exploration :

#Los Angeles County 
los_angeles_dataset <-final_dataset %>% filter(COUNTY == "Los Angeles")
#unique(los_angeles_dataset[,3])

#By County 
los_angeles_dataset1 <- los_angeles_dataset %>%
  group_by(Site_Name)%>% summarize(
    Mean_Variable = mean(Daily_Mean_PM2.5_Concentration),
    Median_Variable = median(Daily_Mean_PM2.5_Concentration),
    Min_Var= min(Daily_Mean_PM2.5_Concentration),
    Max_Var= max(Daily_Mean_PM2.5_Concentration),
    Count = n())

head(los_angeles_dataset1)
# A tibble: 6 × 6
  Site_Name                 Mean_Variable Median_Variable Min_Var Max_Var Count
  <chr>                             <dbl>           <dbl>   <dbl>   <dbl> <int>
1 Azusa                             18.7             16.7     3.1    72.4   415
2 Burbank                           24.0             21.6     3.5    63     122
3 Compton                           13.0             11.9     2.6    54.6   723
4 Glendora                           8.42             7.8    -0.8    56     365
5 Lancaster-Division Street          8.14             7.7     0      24     459
6 Lebec                              4.44             4.2     0      12.4   150
#By Year
los_angeles_dataset2 <- los_angeles_dataset %>%
  group_by(YEAR)%>% summarize(
    Mean_Variable = mean(Daily_Mean_PM2.5_Concentration),
    Median_Variable = median(Daily_Mean_PM2.5_Concentration),
    Min_Var= min(Daily_Mean_PM2.5_Concentration),
    Max_Var= max(Daily_Mean_PM2.5_Concentration),
    Count = n()
  )

head(los_angeles_dataset2)
# A tibble: 2 × 6
  YEAR  Mean_Variable Median_Variable Min_Var Max_Var Count
  <chr>         <dbl>           <dbl>   <dbl>   <dbl> <int>
1 2002           19.7            17.4     0.6    72.4  1879
2 2022           11.0            10.4    -1.2    56    5134
head(los_angeles_dataset1)
# A tibble: 6 × 6
  Site_Name                 Mean_Variable Median_Variable Min_Var Max_Var Count
  <chr>                             <dbl>           <dbl>   <dbl>   <dbl> <int>
1 Azusa                             18.7             16.7     3.1    72.4   415
2 Burbank                           24.0             21.6     3.5    63     122
3 Compton                           13.0             11.9     2.6    54.6   723
4 Glendora                           8.42             7.8    -0.8    56     365
5 Lancaster-Division Street          8.14             7.7     0      24     459
6 Lebec                              4.44             4.2     0      12.4   150
# Box plot by year
boxplot(Daily_Mean_PM2.5_Concentration ~ YEAR, data = los_angeles_dataset, 
        main = "Daily Mean PM2.5 Concentrations in Los Angeles, grouped by year", 
        xlab = "County", 
        ylab = "PM2.5 Concentration")

# Bar plot for Mean_Variable by Site_Name
bar_plot <- ggplot(los_angeles_dataset1, aes(x = Site_Name, y = Mean_Variable)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Site Name", y = "Mean Daily_Mean_PM2.5_Concentration") +
  ggtitle("Mean Daily PM2.5 Concentration by Site Name") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(bar_plot)

los_angeles_dataset1
# A tibble: 18 × 6
   Site_Name                 Mean_Variable Median_Variable Min_Var Max_Var Count
   <chr>                             <dbl>           <dbl>   <dbl>   <dbl> <int>
 1 Azusa                             18.7             16.7     3.1    72.4   415
 2 Burbank                           24.0             21.6     3.5    63     122
 3 Compton                           13.0             11.9     2.6    54.6   723
 4 Glendora                           8.42             7.8    -0.8    56     365
 5 Lancaster-Division Street          8.14             7.7     0      24     459
 6 Lebec                              4.44             4.2     0      12.4   150
 7 Long Beach (North)                18.2             15.5     4.2    62.7   411
 8 Long Beach (South)                12.0             11.4     2.5    26.6   243
 9 Long Beach-Route 710 Nea…         13.4             12.5     2.2    41.7   625
10 Los Angeles-North Main S…         14.6             12.4     2.4    66.3  1276
11 Lynwood                           23.3             19.8     5.4    64     122
12 North Hollywood (NOHO)            13.0             13.1     1.7    43.4   365
13 Pasadena                          14.7             11.8     3.5    57.8   241
14 Pico Rivera #2                    11.4             10.0     3.2    53.8   118
15 Reseda                            12.3             11       0.5    48.8   602
16 Santa Clarita                      9.14             8.5     2.5    27.2   365
17 Signal Hill (LBSH)                 8.85             8.5    -1.2    28.8   293
18 <NA>                              23.9             21.4     5.6    61     118
# Box plot for Daily_Mean_PM2.5_Concentration by Site_Name
box_plot <- ggplot(los_angeles_dataset1, aes(x = Site_Name, y = Mean_Variable)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  labs(x = "Site Name", y = "Mean_Variable") +
  ggtitle("Distribution of Daily PM2.5 Concentration by Site Name") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(box_plot)

For Los Angeles, I made three different visualizations that give about the same information. We can see from the box in whiskers plots that the mean daily PM2.5 concentration in Los Angeles was higher in 2002 compared to 2022. With this information, we hypothesize why there is a difference between mean values between those two years (i.e Less cars? More electric cars? etc.) We can see that both data contain outliers. We also know that there spread is highers in 2002 (range of values). In the barplots, we can see that NA exists. If I were to data clean even more, I would not include the data where the PM2.5 concentrations are not inputted. We can see that Burbank, Lynwood, and NA (we will not mention this) have the highest mean concentration level of PM2.5. We can see that Lebec has the lowest. The same information can be found in the last plot (with green tics). However, I think the blue box plot is a lot better to read and is more visually pleasing.